{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collection Chat Data from Twitch Streamers\n",
    "\n",
    "Twitch chat is a rich and interesting source of text data for NLP projects, but it's not entirely obvious how to get text from their API.\n",
    "\n",
    "[Web scraping](https://learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/) would be one option, but fortunately for us Twitch offers a way to stream chat through [IRC](https://en.wikipedia.org/wiki/Internet_Relay_Chat), which we can easily connect to using Python sockets.\n",
    "\n",
    "To stream messages from Twitch IRC you need to get a to token for authentication. To do that you need to:\n",
    "1. Create a Twitch account \n",
    "2. Go to https://twitchapps.com/tmi/ to request an auth token for your Twitch account. You'll need to click \"Connect with Twitch\" and \"Authorize\" to produce a token\n",
    "\n",
    "Your token should look something like `oauth:43rip6j6fgio8n5xly1oum1lph8ikl1` (fake for this tutorial).\n",
    "\n",
    "Including your token, there's five constants we'll define for the connection to a Twitch channel's chat feed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "server = 'irc.chat.twitch.tv'\n",
    "port = 6667\n",
    "nickname = '<YOUR_USERNAME>'\n",
    "token = '<YOUR_TOKEN>'\n",
    "channel = '#ninja'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`channel` corresponds to the streamer's name and can be the name of any channel you're interested in. I chose Ninja because he is usually streaming every day for several hours and he has *a lot* of people watching him and chatting at once. So we rack up tons of text data quickly.\n",
    "\n",
    "We'll stream one channel at a time to start, but towards the end of the article we'll create a class and command line arguments to watch multiple channels at once. Streaming multiple channels would provide us with some neat real-time data when we have a text processor in place. \n",
    "\n",
    "## Connecting to Twitch with sockets\n",
    "To establish a connection to Twitch IRC we'll be using Python's `socket` library. First we need to instantiate a socket:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import socket\n",
    "\n",
    "sock = socket.socket()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll connect this socket to Twitch by calling `connect()` with the `server` and `port` we defined above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "sock.connect((server, port))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once connected, we need to send our token and nickname for authentication, and the channel to connect to over the socket.\n",
    "\n",
    "With sockets, we need to `send()` these parameters as encoded strings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sock.send(f\"PASS {token}\\n\".encode('utf-8'))\n",
    "sock.send(f\"NICK {nickname}\\n\".encode('utf-8'))\n",
    "sock.send(f\"JOIN {channel}\\n\".encode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`PASS` carries our token, `NICK` carries our username, and `JOIN` carries the channel. These terms are actually common among many IRC connections, not just Twitch. So you should be able to use this for other IRC you wish to connect to, but with different values.\n",
    "\n",
    "Note that we send encoded strings by calling `.encode('utf-8')`. This encodes the string into bytes which allows it to be sent over the socket.\n",
    "\n",
    "### Receiving channel messages\n",
    "\n",
    "Now we have successfully connected and can receive responses from the channel we subscribed to. To get a single response we can call `.recv()` and then decode the message from bytes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "':spappygram!spappygram@spappygram.tmi.twitch.tv PRIVMSG #ninja :Chat, let Ninja play solos if he wants. His friends can get in contact with him.\\r\\n'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resp = sock.recv(2048).decode('utf-8')\n",
    "\n",
    "resp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: running this the first time will show a welcome message from Twitch. Run it again to show the first message from the channel.\n",
    "\n",
    "The `2048` is the buffer size in bytes, or the amount of data to receive. The convention is to use small powers of 2, so 1024, 2048, 4096, etc. Rerunning the above will receive the next message that was pushed to the socket. \n",
    "\n",
    "If we need to close and/or reopen the socket just use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "#sock.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing messages to a file\n",
    "\n",
    "Right now, our socket is being inundated with responses from Twitch but we have two problems:\n",
    "\n",
    "1. We need to continuously check for new messages\n",
    "2. We want to log the messages as they come in\n",
    "\n",
    "To fix, we'll use a loop to check for new messages while the socket is open and use Python's `logging` library to log messages to a file.\n",
    "\n",
    "First, let's set up a basic logger in Python that will write messages to a file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "\n",
    "logging.basicConfig(level=logging.DEBUG,\n",
    "                    format='%(asctime)s — %(message)s',\n",
    "                    datefmt='%Y-%m-%d_%H:%M:%S',\n",
    "                    handlers=[logging.FileHandler('chat.log', encoding='utf-8')])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're setting the log level to DEBUG, which allows all levels of logging to be written to the file. The `format` is how we want each line to look, which will be the time we recorded the line and message from the channel separated by an em dash. The `datefmt` is how we want the time portion of the `format` to be recorded (example below). \n",
    "\n",
    "Finally, we pass a `FileHandler` to `handlers`. We could give it multiple handlers to, for example we could add another handler that prints messages to the console. In this case, we're logging to __chat.log__, which will be created by the handler. Since we're passing a plain filename without a path, the handler will create this file in the current directory. Later on we'll make this filename dynamic to create separate logs for different channels.\n",
    "\n",
    "Let's log the response we received earlier to test it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logging.info(resp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Opening __chat.log__ we can see the first message:\n",
    "\n",
    "`2018-12-10_11:26:40 — :spappygram!spappygram@spappygram.tmi.twitch.tv PRIVMSG #ninja :Chat, let Ninja play solos if he wants. His friends can get in contact with him.`\n",
    "\n",
    "So we have the time the message was logged at the beginning, a double dash separator, and then the message. This format corresponds to the `format` argument we used in `basicConfig`.\n",
    "\n",
    "Later, we'll be parsing these each message and use the time as a piece of data to explore.\n",
    "\n",
    "## Continuous message writing\n",
    "\n",
    "Now on to continuously checking for new messages in a loop.\n",
    "\n",
    "When we're connected to the socket, Twitch (and other IRC) will periodically send a keyword — \"PING\" — to check if you're still using the connection. We want to check for this keyword, and send an appropriate response — \"PONG\". \n",
    "\n",
    "One other thing we'll do is parse emojis so they can be written to a file. To do this, we'll use the [emoji](https://pypi.org/project/emoji/) library that will provide a mapping from emojis to their meaning in words. For example, if a 👍 shows up in a message it'll be converted to `:thumbs_up:`.\n",
    "\n",
    "The following is a while loop that will continuously check for new messages from the socket, send a PONG if necessary, and log messages with parsed emojis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from emoji import demojize\n",
    "\n",
    "while True:\n",
    "    resp = sock.recv(2048).decode('utf-8')\n",
    "\n",
    "    if resp.startswith('PING'):\n",
    "        sock.send(\"PONG\\n\".encode('utf-8'))\n",
    "    \n",
    "    elif len(resp) > 0:\n",
    "        logging.info(demojize(resp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will keep running until you stop it. To see the messages in real-time open a new terminal, navigate to the log's location, and run `tail -f chat.log`.\n",
    "\n",
    "<img src=\"assets/ninja-stream-optimized.gif\" width=700px />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parsing logs\n",
    "\n",
    "Our goal for this section is to parse the chat log into a pandas DataFrame to prepare for analysis\n",
    "\n",
    "The columns we'd like to have for analysis are:\n",
    "- date and time\n",
    "- sender's username\n",
    "- and the message\n",
    "\n",
    "We'll need to parse the information from each line, so let's look at an example line again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "msg = '2018-12-10_11:26:40 — :spappygram!spappygram@spappygram.tmi.twitch.tv PRIVMSG #ninja :Chat, let Ninja play solos'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the date is easy to extract since we know the format and can use the `datetime` library. Let's split it off and parse it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "datetime.datetime(2018, 12, 10, 11, 26, 40)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "time_logged = msg.split()[0].strip()\n",
    "\n",
    "time_logged = datetime.strptime(time_logged, '%Y-%m-%d_%H:%M:%S')\n",
    "\n",
    "time_logged"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! We have a datetime. Let's parse the rest of the message.\n",
    "\n",
    "Since using an em dash (—, or Right-ALT+0151 on Windows) is sometimes used in chat, we will need to split on it, skip the date, and rejoin with an em dash to ensure the message is the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "':spappygram!spappygram@spappygram.tmi.twitch.tv PRIVMSG #ninja :Chat, let Ninja play solos'"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "username_message = msg.split('—')[1:]\n",
    "username_message = '—'.join(username_message).strip()\n",
    "\n",
    "username_message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The message is structure with a username at the beginning, a '#' denoting the channel, and a colon to say where the message begins.\n",
    "\n",
    "<img src=\"assets/twitch-irc-message-structure.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regex is great for this kind of thing. We have three pieces of info we want to extract from a well-formatted string. \n",
    "\n",
    "In the regex search below, each parentheses  — `(.*)` — will capture that part of the string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Channel: ninja \n",
      "Username: spappygram \n",
      "Message: Chat, let Ninja play solos\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "username, channel, message = re.search(':(.*)\\!.*@.*\\.tmi\\.twitch\\.tv PRIVMSG #(.*) :(.*)', username_message).groups()\n",
    "\n",
    "print(f\"Channel: {channel} \\nUsername: {username} \\nMessage: {message}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Excellent. Now we have each piece parsed. Let's loop through the entire chat log, parse each line like the example line, and create a DataFrame at the end. If you haven't used DataFrames that much, definitely check out our [beginners guide to pandas](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/).\n",
    "\n",
    "Here's it all put together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "def get_chat_dataframe(file):\n",
    "    data = []\n",
    "\n",
    "    with open(file, 'r', encoding='utf-8') as f:\n",
    "        lines = f.read().split('\\n\\n\\n')\n",
    "        \n",
    "        for line in lines:\n",
    "            try:\n",
    "                time_logged = line.split('—')[0].strip()\n",
    "                time_logged = datetime.strptime(time_logged, '%Y-%m-%d_%H:%M:%S')\n",
    "\n",
    "                username_message = line.split('—')[1:]\n",
    "                username_message = '—'.join(username_message).strip()\n",
    "\n",
    "                username, channel, message = re.search(\n",
    "                    ':(.*)\\!.*@.*\\.tmi\\.twitch\\.tv PRIVMSG #(.*) :(.*)', username_message\n",
    "                ).groups()\n",
    "\n",
    "                d = {\n",
    "                    'dt': time_logged,\n",
    "                    'channel': channel,\n",
    "                    'username': username,\n",
    "                    'message': message\n",
    "                }\n",
    "\n",
    "                data.append(d)\n",
    "            \n",
    "            except Exception:\n",
    "                pass\n",
    "            \n",
    "    return pd.DataFrame().from_records(data)\n",
    "        \n",
    "    \n",
    "df = get_chat_dataframe('chat.log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly view what we have now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(10591, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>channel</th>\n",
       "      <th>message</th>\n",
       "      <th>username</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dt</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2018-12-10 11:26:40</th>\n",
       "      <td>ninja</td>\n",
       "      <td>Chat, let Ninja play solos if he wants. His fr...</td>\n",
       "      <td>spappygram</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2018-12-10 11:27:22</th>\n",
       "      <td>ninja</td>\n",
       "      <td>!mouse</td>\n",
       "      <td>c_4rn3ge</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2018-12-10 11:27:23</th>\n",
       "      <td>ninja</td>\n",
       "      <td>!song</td>\n",
       "      <td>novaplexutube</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2018-12-10 11:27:24</th>\n",
       "      <td>ninja</td>\n",
       "      <td>https://www.shazam.com/</td>\n",
       "      <td>nightbot</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2018-12-10 11:27:28</th>\n",
       "      <td>ninja</td>\n",
       "      <td>Hm</td>\n",
       "      <td>glmc12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    channel  \\\n",
       "dt                            \n",
       "2018-12-10 11:26:40   ninja   \n",
       "2018-12-10 11:27:22   ninja   \n",
       "2018-12-10 11:27:23   ninja   \n",
       "2018-12-10 11:27:24   ninja   \n",
       "2018-12-10 11:27:28   ninja   \n",
       "\n",
       "                                                               message  \\\n",
       "dt                                                                       \n",
       "2018-12-10 11:26:40  Chat, let Ninja play solos if he wants. His fr...   \n",
       "2018-12-10 11:27:22                                             !mouse   \n",
       "2018-12-10 11:27:23                                              !song   \n",
       "2018-12-10 11:27:24                            https://www.shazam.com/   \n",
       "2018-12-10 11:27:28                                                 Hm   \n",
       "\n",
       "                          username  \n",
       "dt                                  \n",
       "2018-12-10 11:26:40     spappygram  \n",
       "2018-12-10 11:27:22       c_4rn3ge  \n",
       "2018-12-10 11:27:23  novaplexutube  \n",
       "2018-12-10 11:27:24       nightbot  \n",
       "2018-12-10 11:27:28         glmc12  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.set_index('dt', inplace=True)\n",
    "\n",
    "print(df.shape)\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just from streaming messages over a couple of hours we have over 10,000 rows in our DataFrame.\n",
    "\n",
    "Here's a few basic questions I'm particularly interested in:\n",
    "1. Which user commented the most during this time period?\n",
    "2. Which commands — words that start with ! — were used the most?\n",
    "3. What are the most used emotes and emojis?\n",
    "\n",
    "We'll use this dataset in the next article to explore these questions and more.\n",
    "\n",
    "## From here\n",
    "\n",
    "We've create a basic script to monitor a single channel on Twitch and successfully parsed the data into a DataFrame. \n",
    "\n",
    "There's still many improvements that can be made. For example:\n",
    "- Variable logging file for monitoring and comparing different channels\n",
    "- Use Twitch API to retrieve live and popular channels under certain games\n",
    "- Generalize script to stream chat from multiple channels at once\n",
    "\n",
    "Plus we need to answer those questions mentioned above and more. \n",
    "\n",
    "There's a ton of interesting things we could do with this data, and so if you have any ideas for interesting questions to ask, leave it in the comments below!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
